Based on the Data Carpentry curriculum for Data Visualization in Python ( © Data Carpentry under Creative Commons Attribution license )
Python is a general purpose programming language that supports rapid development of scripts and applications.
Python’s main advantages:
The following instructions will show you how to download and open this jupyter notebook interactively, with the notes and exercises for the introductory workshop.
Unzip the folder that downloads
Download Anaconda from https://www.anaconda.com/download/ (select your operating system, then click 'Download'
Open Jupyterlab (click 'Launch')
In Jupyterlab, navigate to where you downloaded the notebook folder
Double click on "Basic Pandas.ipynb"
The notebook should now load!
This Jupyter notebook has been built to run on a python 3 'kernel'. That means you can enter python code into the code cells, and it will be able to execute as python.
To execute code, select the cell(s), then either:
* Choose "Run Selected Cells" from the Command tab on the left
* Click the run button (triangle) at the top of the notebook
* Press "Shift + Enter"
Try to execute the code in the cell below:
In [ ]:
# Like this first line, anything following a hash character (for the rest of that line) is considered a comment, and won't be run as code
text_str = "Congratulations, you've just run some Python code!"
print(text_str)
You will notice that the output of the cell is placed immediately underneath the cell, and that a number appears to the left of the cell to indicate the order in which the cell was run :
In [1]:
You can modify and execute each cell in this notebook as often as you wish, in any order that you wish. Any variables you create in one cell are available for use in another. For example:
In [ ]:
print(text_str)
In [ ]:
text = "Weill Cornell Medicine" # An example of a string
number = 42 # An example of an integer
pi_value = 3.1415 # An example of a float
Here we've assigned data to the variables text
, number
and pi_value
,
using the assignment operator =
. To review the value of a variable, we
can type the name of the variable into the cell and press Shift Enter
:
In [ ]:
text
Everything in Python has a type. To get the type of something, we can pass it
to the built-in function type
:
In [ ]:
type(text)
In [ ]:
type(number)
In [ ]:
type(6.02)
The variable text
is of type str
, short for "string". Strings hold
sequences of characters, which can be letters, numbers, punctuation
or more exotic forms of text (even emoji!).
In [ ]:
2 + 2 # Addition
In [ ]:
6 * 7 # Multiplication
In [ ]:
2 ** 16 # Power
In [ ]:
13 % 5 # Modulo
We can also use comparison and logic operators:
<, >, ==, !=, <=, >=
and statements of identity such as
and, or, not
. The data type returned by this is
called a boolean.
In [ ]:
3 > 4
In [ ]:
True and True
In [ ]:
True or False
In [ ]:
numbers = [1, 2, 3]
numbers[0]
A for
loop can be used to access the elements in a list or other Python data
structure one at a time:
In [ ]:
for num in numbers:
print(num)
Indentation is very important in Python. Note that the second line in the
example above is indented. You can set an indent either by pressing the tab
key, or by entering three spaces.
To find out what methods are available for an object in jupyterlab, press shift + tab
when the cursor is at the end of the object (or inside the parentheses, for a function)
In [ ]:
# Tuples use parentheses
a_tuple= (1, 2, 3)
another_tuple = ('blue', 'green', 'red')
# Note: lists use square brackets
a_list = [1, 2, 3]
In [ ]:
# 1:
In [ ]:
# 2:
In [ ]:
translation = {'one': 1, 'two': 2}
translation['one']
Dictionaries work a lot like lists - except that you index them with keys. You can think about a key as a name for or a unique identifier for a set of values in the dictionary. Keys can only have particular types - they have to be "hashable". Strings and numeric types are acceptable, but lists aren't.
In [ ]:
rev = {1: 'one', 2: 'two'}
rev[1]
In [ ]:
bad = {[1, 2, 3]: 3}
In Python, a "Traceback" is a multi-line error block printed out for the user.
To add an item to the dictionary we assign a value to a new key:
In [ ]:
rev = {1: 'one', 2: 'two'}
rev[3] = 'three'
rev
Using for
loops with dictionaries is a little more complicated. We can do
this in two ways:
In [ ]:
for key, value in rev.items():
print(key, '->', value)
or
In [ ]:
for key in rev.keys():
print(key, '->', rev[key])
Challenge - Can you do reassignment in a dictionary?
- First create a dictionary
challenge_dict
with the following key:value pairs:
- 1 : 'uno'
- 2 : 'dos'
- 3 : 'tres'
- Display the keys and values in the dictionary
Try to reassign the second value (in the key value pair) so that it no longer reads "dos" but instead reads "apple-sauce".
Now display your dictionary again to see if it has changed.
In [ ]:
# 1
challenge_dict =
In [ ]:
# 2
In [ ]:
# 3
In [ ]:
# 4
It is important to note that dictionaries are "unordered" and do not remember the sequence of their items (i.e. the order in which key:value pairs were added to the dictionary). Because of this, the order in which items are returned from loops over dictionaries might appear random and can even change with time.
We can automate the process of performing data manipulations in Python. It's efficient to spend time building the code to perform these tasks because once it's built, we can use it over and over on different datasets that use a similar format. This makes our methods easily reproducible. We can also easily share our code with colleagues and they can replicate the same analysis.
For this lesson, we will be using NCD Risk Factor Collaboration (NDC-RisC) data, from Worldwide trends in body-mass index, underweight, overweight, and obesity from 1975 to 2016: a pooled analysis of 2416 population-based measurement studies in 128.9 million children, adolescents, and adults.
Lancet 2017, published online 11 October 2017
We will be using the adult BMI dataset country-specific data.
The downloaded filename is NCD_RisC_Lancet_2017_BMI_age_standardised_country.csv
, but we have saved a simplified version of the file here as NCD_RisC_bmi.csv
.
We are studying the body-mass index (BMI) of individuals from 200 countries over a number of years. The dataset is stored as a .csv
file: each row holds information for a single measurement period, and the columns represent:
Column | Description |
---|---|
country | Country |
iso | Country code |
sex | Gender of the group |
year | Year measurements were taken |
bmi | Mean BMI |
lower95ci | lower 95% confidence interval |
upper95ci | upper 95% confidence interval |
obese_pc | prevalance of BMI>30 (obese) |
obese_lower95ci | lower 95% confidence interval |
obese_upper95ci | upper 95% confidence interval |
The first few rows of our first file look like this:
country,iso,sex,year,bmi,lower95ci,upper95ci,obese_pc,obese_lower95ci,obese_upper95ci
Afghanistan,AFG,Men,1975,18.9994401470858,16.3311193806968,21.7630768066895,0.00207754471275783,0.000383525249039564,0.00651548709746884
Afghanistan,AFG,Men,1976,19.1055182251737,16.5000772402149,21.8011089747524,0.00223682902494332,0.000437170207444467,0.0068203531416254105
Afghanistan,AFG,Men,1977,19.2120802193391,16.6465068023822,21.8420206008889,0.0024101143094182,0.000497358014754166,0.00717898800855567
Afghanistan,AFG,Men,1978,19.3190525301356,16.7868986048326,21.8765159552146,0.00259867365703349,0.000558764696829654,0.007567109958970071
A library in Python contains a set of tools (called functions) that perform tasks on our data. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is set up, it can be used or called to perform many tasks.
One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy (which is another Python library) arrays.
Python doesn't load all of the libraries available to it by default. We have to
add an import
statement to our code in order to use library functions. To import
a library, we use the syntax import libraryName
. If we want to give the
library a nickname to shorten the command, we can add as nickNameHere
. An
example of importing the pandas library using the common nickname pd
is below.
In [ ]:
import pandas as pd
Each time we call a function that's in a library, we use the syntax
LibraryName.FunctionName
. Adding the library name with a .
before the
function name tells Python where to find the function. In the example above, we
have imported Pandas as pd
. This means we don't have to type out pandas
each
time we call a Pandas function.
We will begin by locating and reading our survey data which are in CSV format. CSV stands for Comma-Separated Values and is a common way store formatted data. Other symbols my also be used, so you might see tab-separated, colon-separated or space separated files. It is quite easy to replace one separator with another, to match your application. The first line in the file often has headers to explain what is in each column. CSV (and other separators) make it easy to share data, and can be imported and exported from many applications, including Microsoft Excel. For more details on CSV files, see the Data Organisation in Spreadsheets lesson.
We can use Pandas' read_csv
function to pull the file directly into a
DataFrame.
A DataFrame is a 2-dimensional data structure that can store data of different
types (including characters, integers, floating point values, factors and more)
in columns. It is similar to a spreadsheet or an SQL table or the data.frame
in
R. A DataFrame always has an index (0-based). An index refers to the position of
an element in the data structure.
In [ ]:
# Note that pd.read_csv is used because we imported pandas as pd
pd.read_csv("data/NCD_RisC_bmi.csv")
We can see that there were 16,800 rows parsed. Each row has 10
columns. The first column is the index of the DataFrame. The index is used to
identify the position of the data, but it is not an actual column of the DataFrame.
It looks like the read_csv
function in Pandas read our file properly. However,
we haven't saved any data to memory so we can work with it. We need to assign the
DataFrame to a variable. Remember that a variable is a name for a value, such as x
,
or data
. We can create a new object with a variable name by assigning a value to it using =
.
Let's call the imported survey data df_surveys
:
In [ ]:
df_surveys = pd.read_csv("data/NCD_RisC_bmi.csv")
Notice when you assign the imported DataFrame to a variable, Python does not
produce any output on the screen. We can view the value of the df_surveys
object by typing its name into the Python command prompt.
In [ ]:
df_surveys
which prints contents like above.
Note: if the output is too wide to print on your narrow terminal window, you may see something slightly different as the large set of data scrolls past. You may see simply the last column of data:
17 NaN
18 NaN
19 NaN
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 NaN
26 NaN
27 NaN
28 NaN
29 NaN
... ...
35519 36.0
35520 48.0
35521 45.0
35522 44.0
35523 27.0
35524 26.0
35525 24.0
35526 43.0
35527 NaN
35528 25.0
35529 NaN
35530 NaN
35531 43.0
35532 48.0
35533 56.0
35534 53.0
35535 42.0
35536 46.0
35537 31.0
35538 68.0
35539 23.0
35540 31.0
35541 29.0
35542 34.0
35543 NaN
35544 NaN
35545 NaN
35546 14.0
35547 51.0
35548 NaN
[35549 rows x 9 columns]
Never fear, all the data is there, if you scroll up. Selecting just a few rows, so it is easier to fit on one window, you can see that pandas has neatly formatted the data to fit our screen:
In [ ]:
df_surveys.head() # The head() function displays the first several lines of a file. It is discussed below.
In [ ]:
type(df_surveys)
As expected, it's a DataFrame (or, to use the full name that Python uses to refer
to it internally, a pandas.core.frame.DataFrame
).
What kind of things does df_surveys
contain? DataFrames have an attribute
called dtypes
that answers this:
In [ ]:
df_surveys.dtypes
All the values in a column have the same type. For example, year has type
int64
, which is a kind of integer. Cells in the year column cannot have
fractional values, but the bmi and obese_pc columns can, because they
have type float64
. The object
type doesn't have a very helpful name, but in
this case it represents strings (such as 'Men' and 'Women' in the case of sex).
We'll talk a bit more about what the different formats mean in a different lesson.
There are many ways to summarize and access the data stored in DataFrames, using attributes and methods provided by the DataFrame object.
To access an attribute, use the DataFrame object name followed by the attribute
name df_object.attribute
. Using the DataFrame df_surveys
and attribute
columns
, an index of all the column names in the DataFrame can be accessed
with df_surveys.columns
.
In [ ]:
df_surveys.columns
Methods are called in a similar fashion using the syntax df_object.method()
.
As an example, df_surveys.head()
gets the first few rows in the DataFrame
df_surveys
using the head()
method. With a method, we can supply extra
information in the parentheses to control behaviour.
Let's look at the data using these.
Challenge - DataFrames
Using our DataFrame
df_surveys
, try out the attributes & methods below to see what they return.
df_surveys.columns
df_surveys.shape
Take note of the output ofshape
- what format does it return the shape of the DataFrame in?HINT: More on tuples, here.
df_surveys.head()
Also, what doesdf_surveys.head(15)
do?df_surveys.tail()
We've read our data into Python. Next, let's perform some quick summary statistics to learn more about the data that we're working with. We can perform summary stats quickly using groups. But first we need to figure out what we want to group by.
Let's begin by exploring our data:
In [ ]:
# Look at the column names
df_surveys.columns
Let's get a list of all the years. The pd.unique
function tells us all of
the unique values in the year
column.
In [ ]:
pd.unique(df_surveys['year'])
In [ ]:
df_surveys['bmi'].describe()
We can also extract one specific metric if we wish:
In [ ]:
df_surveys['bmi'].min()
In [ ]:
df_surveys['bmi'].max()
In [ ]:
df_surveys['bmi'].mean()
In [ ]:
df_surveys['bmi'].std()
In [ ]:
df_surveys['bmi'].count()
But if we want to summarize by one or more variables, for example sex, we can
use Pandas' .groupby
method. Once we've created a groupby DataFrame, we
can quickly calculate summary statistics by a group of our choice.
In [ ]:
# Group data by sex
grouped_data = df_surveys.groupby('sex')
The pandas function describe
will return descriptive stats including: mean,
median, max, min, std and count for a particular column in the data. Pandas'
describe
function will only return summary values for columns containing
numeric data.
In [ ]:
# Summary statistics for all numeric columns by sex
grouped_data.describe()
In [ ]:
# Provide the mean for each numeric column by sex
grouped_data.mean()
The groupby
command is powerful in that it allows us to quickly generate
summary stats.
Challenge - Summary Data
- How many recorded individuals are women and how many are men?
- What happens when you group by two columns using the following syntax and then grab mean values:
grouped_data2 = df_surveys.groupby(['country','sex'])
grouped_data2.mean()
- Summarize bmi values for each country in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data
by_country['bmi'].describe()
In [ ]:
# Count the number of samples by species
year_counts = df_surveys.groupby('country')['year'].count()
print(year_counts)
Or, we can also count just the rows with the USA:
In [ ]:
df_surveys.groupby('country')['year'].count()['United States of America']
In [ ]:
# Multiply all weight values by 2
df_surveys_doubled = df_surveys['bmi'] * 2
In [ ]:
# Make sure figures appear inline in Jupyter Notebook
%matplotlib inline
In [ ]:
# Create a quick bar chart
year_bmi = df_surveys.groupby("year").mean()["bmi"]
year_bmi.plot(kind='bar');
We can also look at how many countries were measured in each year:
In [ ]:
total_count = df_surveys.groupby('year')['country'].nunique()
# Let's plot that too
total_count.plot(kind='bar');
(Perhaps a boring graph, but a good way to check that each year is consistent)
Challenge - Plots
- Create a plot of average bmi across all years per country. (to resize your graph, try using the keyword
figsize=(x-inches,y-inches)
). Can you determine the highest average bmi is for a country? What is the lowest? To sort a dataframe by the values in a particular column, you can use the df.sort_values( column_name )- Create a plot of average male versus average female obesity prevalence for the entire dataset.
In [ ]:
df_surveys.groupby("country").mean()['bmi'].sort_values()
Summary Plotting Challenge
Create a stacked bar plot, with obesity prevalence on the Y axis, and the stacked variable being sex. The plot should show mean obesity prevalence by sex for each year. Some tips are below to help you solve this challenge:
- For more on Pandas plots, visit this link.
- You can use the code that follows to create a stacked bar plot but the data to stack need to be in individual columns. Here's a simple example with some data where 'a', 'b', and 'c' are the groups, and 'one' and 'two' are the subgroups.
In [ ]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)
We can plot the above with
In [ ]:
# Plot stacked data so columns 'one' and 'two' are stacked
my_df = pd.DataFrame(d)
my_df.plot(kind='bar',stacked=True,title="The title of my graph")
- You can use the
.unstack()
method to transform grouped data into columns for each plotting. Try running.unstack()
on some DataFrames above and see what it yields.Start by transforming the grouped data (by plot and sex) into an unstacked layout, then create a stacked plot.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
>
First we group data by year and by sex, and then calculate a mean for each year.
In [ ]:
by_year_sex = df_surveys.groupby(['year','sex'])
year_sex_mean = by_year_sex['obese_pc'].mean()
This calculates the mean of bmis for each sex within each year as a table
In [ ]:
year_sex_mean.head()
>
Below we'll use
.unstack()
on our grouped data to figure out the BMI that each sex was measured at each year.
In [ ]:
spc = year_sex_mean.unstack()
spc
Now, create a stacked bar plot with that data where the BMi for each sex are stacked by year.
Rather than display it as a table, we can plot the above data by stacking the values of each sex as follows:
In [ ]:
s_plot = spc.plot(kind='bar',stacked=True,title="Mean BMI by year and sex")
s_plot.set_ylabel("Prevalence of obesity")
s_plot.set_xlabel("Year")
In [ ]: